A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

نویسندگان

Rayner Alfred

Adam Mujat

Joe Henry Obit

چکیده

Abstrak The Malay language is an Austronesian language spoken in most countries in the South East Asia region that includes Malaysia, Indonesia, Singapore, Brunei and Thailand. Traditional linguistics is well developed for Malay but there are very limited resources and tools that are available or made accessible for computer linguistic analysis of Malay language. Assigning part of speech (POS) to running words in a sentence for Malay language is one of the pipeline processes in Natural Language Processing (NLP) tasks and it is not well investigated. This paper outlines an approach to perform the Part of Speech (POS) tagging for Malay text articles. We apply a simple Rule-based Part of Speech (RPOS) tagger to perform the tagging operation on Malay text articles. POS tagging can be described as a task of performing automatic annotation of syntactic categories for each word in a text document. A rule-based POS tagger generally involves a POS tag dictionary and a set of rules in order to identify the words that are considered parts of speech. In this paper, we propose a framework that applies Malay affixing rules to identify the Malay POS tag and the relation between words in order to select the best POS tag for words that have two or more valid POS tags. The results show that the performance accuracy of the ruled-based POS tagger is higher compared to a statistical POS tagger. This indicates that the proposed RPOS tagger is able to predict any unknown word's POS at some promising accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Part of Speech Tagger for Malay Language Based on Words Morphology

PART OF SPEECH TAGGER FOR MALAY LANGUAGE BASED ON WORDS MORPHOLOGY Mohd Pouzi Hamzah, Syarifah fatem Na’imah Binti Syed Kamaruddin School of Informatics and Applied Mathematics, Universiti Malaysia Terengganu, 21030 Kuala Terengganu, Terengganu, Malaysia Email: [email protected], [email protected] ABSTRACT : Part of Speech (POS) tagging is an essential task in pre-processing for text process...

متن کامل

Open Source Corpus Analysis Tools for Malay

Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of eac...

متن کامل

Brill’s rule-based PoS tagger

Eric Brill introduced a PoS tagger in 1992 that was based on rules, or transformations as he calls them, where the grammar is induced directly from the training corpus without human intervention or expert knowledge. The only additional component necessary is a small, manually and correctly annotated corpus the training corpus which serves as input to the tagger. The system is then able to deriv...

متن کامل

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...

متن کامل

Part-Of-Speech Tagging With Neural Networks

Text corpora which are tagged with part-of-speech information are useful in many areas of linguistic research. In this paper, a new part-of-speech tagging method hased on neural networks (Net-Tagger) is presented and its performance is compared to that of a llMM-tagger (Cutting et al., 1992) and a trigrambased tagger (Kempe, 1993). It is shown that the Net-Tagger performs as well as the trigram...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

نویسندگان

چکیده

منابع مشابه

Part of Speech Tagger for Malay Language Based on Words Morphology

Open Source Corpus Analysis Tools for Malay

Brill’s rule-based PoS tagger

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

Part-Of-Speech Tagging With Neural Networks

عنوان ژورنال:

اشتراک گذاری